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ABSTRACT 

In a state-update protocol for a system of L asynchronous parallel processes that 
communicate only with nearest neighbors, global desynchronization in operation 
times can be deduced from kinetic roughening of the corresponding virtual-time 
horizon (VTH). The utilization of the parallel processing environment can be de- 
duced by analyzing the microscopic structure of the VTH. In this chapter we give an 
overview of how the methods of non-equilibrium surface growth (physics of complex 
systems) can be applied to uncover some properties of state update algorithms used 
in distributed parallel discrete-event simulations (PDES). In particular, we focus on 
the asynchronous conservative PDES algorithm in a ring communication topology. 
The time evolution of its VTH is simulated numerically as asynchronous cellular 
automaton whose update rule corresponds to the update rule followed by this algo- 
rithm. There are two cases of a balanced load considered: (1) the case of the minimal 
load per processor, which is expected to produce the lowest utilization (the so-called 
worst-case performance scenario); and, (2) the case of a general finite load per pro- 
cessor. In both cases, we give theoretical estimates of the performance as a function 
of L and the load per processor, i.e., approximate formulas for the utilization (thus, 
the mean speedup) and for the desynchronization (thus, the mean memory request 
per processor). It is established that the memory request per processor, required for 
state savings, does not grow without limit for a finite number of processors and a 
finite load per processor but varies as the conservative PDES evolve. For a given 
simulation size, there is a theoretical upper bound for the desynchronization and a 
theoretical non-zero lower bound for the utilization. We show that the conservative 
PDES are generally scalable in the ring communication topology. The new approach 
to performance studies, outlined in this chapter, is particularly useful in the search 
for the design of a new-generation of algorithms that would efficiently carry out an 
autonomous or tunable synchronization. 
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I. INTRODUCTION 

Parallel discrete-event simulations (PDES) are a technical tool to uncover 
the dynamics of information-driven complex systems. Their wide range of 
applications in contemporary sciences and technology Q has made them an 
active area of research in recent years. Parallel and distributed simulation 
systems constitute a complex system of their own, whose properties can be 
uncovered with the well-established tools of statistical physics. 

In PDES physical processes are mapped to logical processes (assigned to 
processors). Each logical process manages the state of the assigned physical 
subsystem and progresses in its local virtual time (LVT). The main challenge 
arises because logical processes are not synchronized by a global clock. Conse- 
quently, to preserve causality in PDES the algorithms should incorporate the 
so-called local causality constraint whereby each logical process processes 
the received messages from other processes in non-decreasing time-stamp or- 
der. Depending on the way the local causality constraint is implemented, there 



are two broadly defined classes of update protocols 
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In conservative PDES, 



an algorithm does not allow a logical process to advance its LVT until it is 
certain that no causality violation can occur. In the conservative update pro- 
tocol a logical process may have to wait to ensure that no message with a 
lower time stamp is received later. In optimistic PDES, an algorithm allows a 
logical process to advance its LVT regardless of the possibility of a causality 
error. The optimistic update protocol detects causality errors and provides 
a recovery procedure from the violation of the local causality constraint by 
rolling back the events that have been processed prematurely. 

There are several aspects of PDES algorithms that should be considered in 
systematic efficiency studies. Some important aspects are: the synchroniza- 
tion procedures, the utilization of the parallel environment as measured by the 
fraction of working processors, memory requirements which may be assessed 
by measuring the statistical spread in LVTs (i.e., the desynchronization), inter- 
processor communication handling, scalability as measured by evaluating the 
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performance when the number of computing processors becomes large, and the 
speedup as measured by comparing the performance with sequential simula- 
tions. In routinely performed studies to date, the efficiency is investigated in 
a heuristic fashion by testing the performance of a selected application in a 
chosen PDES environment, i.e., in a parallel simulator. 

Only recently a ne w appro ach in performance studies has been introduced 

in which the properties of the algorithm are 
examined in an abstract way, without a reference to a particular application 
platform. In this approach the main concept is the simulated virtual time 
horizon (VTH) defined as the collection of LVTs of all logical processes. The 
evolution rule of this VTH is defined by the communication topology among 
processors and by the way in which the algorithm handles the advances in 
LVTs. The key assumption here is that the properties of the algorithm are 
encoded in its representative VTH in analogy with the way in which the prop- 
erties of a complex system are encoded in some representative non-equilibrium 
interface. In this way, fundamental properties of the algorithm can be deduced 
by analyzing its corresponding simulated VTH. 

In this chapter we give an overview of how the methods of non-equilibrium 
surface growth (physics of complex systems) can be applied to uncover some 
properties of state update algorithms used in PDES. In particular, we focus on 
the asynchronous conservative PDES algorithm in a ring communication topol- 
ogy, where each processor communicates only with its immediate neighbors. 
The time evolution of its VTH is simulated numerically as an asynchronous 
cellular automaton whose update rule corresponds to the update rule followed 
by this algorithm. The purpose of this study is to uncover generic properties 
of this class of algorithms. 

In modeling of the conservative update mode in PDES, we represent se- 
quential events on processors in terms of their corresponding LVTs. A system 
of processors or processing elements (PE) is represented as a one-dimensional 
grid. The column height that rises above the k-th grid point is a building 
block of the simulated VTH and represents the total time of operations per- 
formed by the k-th processor. These operations can be seen as a sequence 
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of update cycles, where each cycle has two phases. The first phase is the 
processing of the assigned set of discrete events (e.g., spin flipping on the as- 
signed sublattice for dynamic Monte Carlo simulation of lattice systems). This 
phase is followed by a messaging phase that closes the cycle, when a processor 
broadcasts its findings to other processors. But the messages broadcasted by 
other processors may arrive any time during the cycle. Processing related to 
these messages (e.g., memory allocations/deallocations, sorting and/or other 
related operations) are handled by other algorithms that carry their own vir- 
tual times. In fact, in actual simulations, this messaging phase may take an 
enormous amount of time, depending on the hardware configuration and the 
message processing algorithms. In our modeling the time extent of the messag- 
ing phase is ignored as though communications among processors were taking 
place instantaneously. In this sense we model an ideal system of processors. 
The LVT of a cycle represents only the time that logical processes require to 
complete the first phase of a cycle. Therefore, the spread in LVTs represents 
only the desynchronisation that arises due to the asynchronous conservative 
algorithm alone. By the same token, all other performance indicators such as, 
e.g., the overall efficiency or the utilization of the parallel processing environ- 
ment, that are read out of the simulated VTH are the intrinsic properties of 
this algorithm. 

This chapter is organized as follows. The simulation model of asynchronous 
conservative updates and the mapping between the logical processes and the 
physical processes considered in this study are explained in Sec.|Tl| Section UTTl 
outlines the selected ideas taken from non-equilibrium surface science that are 
used in the interpretation of simulation results; in particular, the concepts 
of universality and a non-universal microscopic structure that are relevant 
in deducing algorithmic properties from the simulated VTH. One group of 
these properties includes the utilization and the speedup, which is provided 
in Sec. IIVI Another group includes the desynchronization and the memory 
request per processor, required for past state savings, which is presented in 
Sec. El Performance of the conservative and the optimistic PDES algorithms is 
discussed in Sec. IVII The new approach to performance studies, outlined in this 
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computing system ^ physical system 

logical processes ^ physical processes 



FIG. 1: The mapping between physical processes and logical processes. The nearest- 
neighbor physical interactions on a lattice with periodic boundary conditions (the 
right part) are mapped to the ring communication topology of logical processes 
(two-sided arrows in the left part). Each PE carries a sublattice of N sites. Com- 
munications take place only at border sites. Each PE has at most two effective 
border sites, i.e., neighboring PEs that it communicates with. 

chapter, can be a very convenient design tool in the engineering of algorithms. 
This issue and directions for future research are discussed in Sec. IVIII 

II. SIMULATED VIRTUAL TIME HORIZON 

In simulations a system of L processors is represented as a set of equally 
spaced lattice points k, k = 1,2, ...,L. Each processor performs a number of 
operations and enters a communication phase to exchange information with its 
immediate neighbors. This communication phase, called an update attempt, 
takes no time in our simulations. In this sense we simulate an ideal system of 
processors, as explained in Sec. H] An update attempt is assigned an integer 
index t that has the meaning of a wall-clock time (in arbitrary units, which 
may be thought of as a fixed number of ticks of the CPU clock). The local 
virtual time hi:{t) at the k-th processor site represents the cumulative local 
time of all operations on the k-th processor from the beginning at t = to 
time t. These local processor times are not synchronized by a global clock. 

There is a two-way correspondence between the physical system being sim- 
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ulated in PDES and the system of PEs that are to perform these PDES in a 
manner consistent with and faithful to the underlying stochastic dynamics of 
the physical system, as depicted in FiglH On the one hand, by spatially dis- 
tributing a physical lattice with the nearest-neighbor interactions and periodic 
boundaries among L processors, the asynchronous nature of physical dynamics 
is carried over to the asynchronous nature of logical processes in the ring com- 
munication topology of the computing system. On the other hand, the ring 
communication topology among processors is mapped onto a lattice arrange- 
ment with periodic boundary conditions, hi+kit) = hk(t), and asynchronous 
update events in the system of PEs can be modeled as an asynchronous cellular 
automaton on this lattice. 

The set of local virtual times hk{t) forms the VTH at t (see Fig. 
The time-evolution of the VTH is simulated by an update rule, where lo- 
cal height increments rik{t) are sampled from the Poisson distribution of unit 
mean fip = 1. The form of the deposition rule depends on the processor load 
N, as explained later in this section. 

A general principle that governs the conservative update protocol requires 
a processor to idle if at update attempt t the local causality constraint may 
be violated. This happens when at t the k-th processor does not receive the 
information from its neighboring processor (or processors) if such information 
is required to proceed in its computation. This corresponds to a situation 
when the local virtual time hk{t) of the k-th processor is ahead of either one 
of the local virtual times hk_i{t) or hk+i{t) of its left and right neighbors, 
respectively. In this unsuccessful update attempt the local virtual time hi:{t) 
is not incremented, i.e., the kth processor waits: hk{t + 1) = hk{t). In another 
case, for example, when at t the kth processor does not need information from 
its neighbors it performs an update regardless of the relation between its local 
virtual time and the local virtual times on neighboring processors. At every 
successful update attempt, the simulated local virtual time at the k-th PE- 
site is incremented for the next update attempt: h^^t + 1) = hk(t) + rj^it), 
where rj^it) = — In(rjtf), and r^t € (0; 1] is a uniform random deviate. The 
simulations start from the fiat-substrate condition at t = 0: hk{0) = 0. 
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One example of computations that follow the above model is a dynamic 
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In a parallel environ- 



Monte Carlo simulation for Ising spins 
ment, a spin lattice is spatially distributed among L processors in such a way 
that each processor carries an equal load of one contiguous sublattice that 
consists of spin sites (i.e., each processor has a load of volumes). Some 
of these spin-lattice sites belong to border slices, i.e., at least one of their 
immediate neighbors resides on the sublattice of a neighboring processor. Pro- 
cessors perform concurrent spin- flip operations (i.e., increment their LVTs) as 
long as a randomly selected spin-site is not a border site. If a border spin-site 
is selected, to perform a state update a processor needs to know the current 
spin-state of the corresponding border slice of its neighbor. If this information 
is not available at the t update attempt (because the neighbor's local time is 
behind), by the conservative update rule the processor waits until this infor- 
mation becomes available, i.e., until the neighbor's local virtual time catches 
up with or passes its own local virtual time. 

The least favorable parallelization is when each processor carries the min- 
imal load of = 1. Computationally, this system can be identified with a 
closed spin chain where each processor carries one spin-site. At each update 
attempt each processor must compare its LVT with the local times on both 
of its neighbors. The second least favorable arrangement is when N = 2. As 
before, at each update attempt every processor must compare its local time 
with the local time of one of its neighbors. When A^ > 3, at update attempt 
t, the comparison of the local virtual times between neighbors is required only 
if the randomly selected volume site is from a border slice. 

The above three cases are realized in simulations by the following three 
update rules. When A^ = 1, the update attempt at t is successful iff 

hk(t) < mm{hk^i(t),hk+i(t)} . (1) 

When A^ = 2, at any site k where the update attempt was successful at (t — l), 
at t we first randomly select a neighbor (left or right). This is equivalent to 
selecting either the left or the right border slice on the kth processor. The 
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FIG. 2: The growth and roughening of the simulated VTH interface: snapshots at 
ti and at a later time t2- Local heights are in arbitrary units. Here, L = 100 and 
= 1. 

update attempt is successful iff 

hkit) < Kit), (2) 

where n is tlie randomly selected neighbor {n = k — 1 for the left, n = k + 1 
for the right). At any site k where the update attempt was not successful at 
(t — 1), at t we keep the last n value. When > 3, at any site k where the 
update attempt was successful at (t — 1), at t we first randomly select any 
of the volume sites (indexed by Uk) assigned to a processor. The selected 
site can be either from the border sites (either = 1 or = A^) or from the 
interior. The attempt is successful if the selected site is the interior site. When 
the border site is selected, the attempt is successful if condition is satisfied. 
As for A^ = 2, at any site k where the update attempt was not successful at 
(t — 1), at t we keep the last value. 

In this way the simulated VTH, corresponding to the conservative update 
rule followed by the PDES algorithm, emerges as a one- dimensional non- 
equilibrium surface grown by depositions of Poisson-random time increments 
that model waiting times. Two sample VTH surfaces are presented in Fig. |21 
Major properties of the corresponding algorithm are encoded in these inter- 
faces. In principle, with the help of statistical physics, one should be able to 
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obtain from VTH such properties as the utihzation, the ideal speedup, the 
desynchronization, the memory request per processor, the overall efficiency 
and the scalability. One basic property is the mean utilization {u{t)), which 
can be assessed as the fraction of sites in the VTH interface that performed 
an update at t, averaged over many independent simulations. For the minimal 
load per processor = 1, (^(t)) is simply the mean density of local minima 
of the interface (Fig. |21). Another basic property is the mean desynchroniza- 
tion {w(t)) in operation times, which can be estimated from simulations as 
the mean statistical spread (roughness) of the VTH interface. In the following 
section we review some useful concepts from surface physics relevant to our 
study. 



III. THE (1 + 1) DIMENSIONAL NONEQUILIBRIUM INTERFACES 



The roughness of a surface that grows on a one dimensional substrate of L 
sites can be expressed by its interface width w{t) at time t 

(^^t)) = (^li^ihkit) -m^y (3) 

where hk{t) is the height of the column at site k and h{t) is the average height 
over L sites, h(t) = (l/L) J2k=i ^k(t). The angular brackets denote the average 
over many interface configurations that are obtained in many independent 
simulations. In our study these configurational averages were computed over 
800 simulations, unless noted otherwise. 

Based on the time-evolution of w{t), interfaces can be classified in various 
universality classes (for an overview see Ref. 2J]). The idea behind the concept 
of universality is that, in a statistical description, the growth of the surface 
depends only on the underlying mechanism that generates correlations among 
time-evolving columns hk{t) and not on the physical particulars of physical 
(or other) interactions that cause the growth. For instance, two completely 
different physical interactions among deposited constituents (e.g., one of a 
magnetic nature and the other of a social nature) may generate two equivalent 
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surfaces of one universality class, depending on the observed evolution of w(t). 
The simplest case of surface growth is random deposition (RD), when the 
column heights hk{t) grow independently of each other. The RD interface 
is totally uncorrelated. The time-evolution of its width is characterized by a 
never-ending growth in accordance to the power law w{t) ~ t^, with the growth 
exponent (3 = 1/2. Such growth defines the RD universality class. 

The self-affined roughness of the interface manifests itself by the existence 
of Family- Vicsek scaling : 

n^\t) = L^V , (4) 

where the scaling function f{y) describes two regimes of the width evolution: 

fiy)-{ (5) 

I const. , y ^ 1. 

The dynamic exponent z gives the time-evolution of the lateral correlation 
length i{t) ~ t^l\ i.e., at a given t the largest distance along the substrate 
between two correlated columns. When ^(t) exceeds the system size L the 
width saturates and does not grow any more. At saturation, for t ^ t^, for a 
given L the width remains constant and obeys the power law w ^ L'^, where 
a is the roughness exponent. The growth phase is the initial phase for t <^t^ 
before the cross- over time t X ~ L"^ at which saturation sets in. The growth 
phase is characterized by the single growth exponent (3 = a/ z. The roughness, 
growth and dynamic exponents are universal, i.e., their values depend only on 
the underlying mechanism that generates correlations. 

A simple continuum model of non-equilibrium growth that leads to scaling 
is provided by the Kardar-Parisi-Zhang (KPZ) equation [2^: 

ht = v{t) + ph^^ + ^hl + C, (6) 

where h = h{x, t) is the height field (subscripts denote partial derivatives), v{t) 
is the mean interface velocity, v{t) = {dh{t)/dt), and C{x,t) is the uncorrelated 
Gaussian noise. The coefficients u and A give the strength of the linear damping 
and the coupling with the nonlinear growth, respectively. A renormalization 
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group analysis 



2J, |26[ can provide a connection between the stochastic growth 
equation and scahng exponents. 

The KPZ universality class, governed by the dynamics of Eq.(jni), is charac- 
terized by a = 1/2 and /5 = 1/3, and the exponent identity a + z = 2. When 
A = in Eg .(1^ 1, t he growth is governed by the hnear Edwards- Wilkinson 
(EW) equation j2_7|. The EW universality class is characterized by a = 1/2 
and (3 = 1/4, and the EW exponent identity is 2a + 1 = ;z. When A = and 
z/ = in Eq.®, the growth belongs to the RD universality class. Unlike the 
KPZ and EW interfaces, the RD interface is not self-affined. 

The origins of scale invariance, as in Eq.(0}, and the universal properties 
of time-evolving surfaces are well understood 24]. In this study, we use the 
universal properties of the simulated VTH to investigate the scalability of the 
corresponding PDES algorithm. However, there are many instances where 
non- universal properties, i.e., those pertaining to the microscopic structure of 
the interface, are of importance. In this study, one example is the density of 
local minima or the density of update sites of the VTH interface. It is safe to 
say that there is no general silver-bullet-type of approach to these problems. 
For the case study of VTH interfaces, we were able to develop a discrete-event 

culating a probability distri- 



For the closed linear 



analytic technique that provides a means for ca 
bution for events that take place on the surface 
chain of L processors carrying minimal load, when the VTH is simulated by 
Poisson-random depositions at local interface minima, the probability distri- 



bution P{p; L) of the update events on the corresponding VTH surface is 17[: 



2L-2 y2p - 1^ 

where p is the number of updates at the t-th update attempt after the simu- 
lations reach a steady state. This distribution can be used to derive approx- 
imate closed formulas for mean quantities measured in simulations, e.g., for 
{u{t)). Also, it is a starting point in the derivation of analogous distributions 
P{p; L;N) for the cases when each processor carries a larger load of = 2 or 
> 3 [28|. The advantage of knowing P{p; L; N) is that it enables one to com- 
pute analytically quantities that otherwise can be only estimated qualitatively 
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FIG. 3: Simulated time evolutions of characteristic densities and the scaled VTH 
interface velocity during simulations with the minimal load = 1 per processor 
{L = 1000). Time to marks the transition to steady-state simulations and tx is the 
saturation time, as explained in the text. For times later than to, both the utilization 
{u) (diamonds) and the velocity v (filled circles) are constant. 



in simulations. 



IV. UTILIZATION OF THE PARALLEL PROCESSING 

ENVIRONMENT 

For a real system, the mean utilization {u{t)) is defined either as the number 
or as the fraction of processors that on average work simultaneously at a time. 
In our model, {u{t)) is the mean fraction (i.e., the density) of update sites 
in the simulated VTH. The simulated time evolution of {u{t)) is presented in 
Fig. El (for = 1) and in Fig. HI (for = 100) that illustrate the following ob- 
servations. First, {u(t)) is not constant as the simulations evolve but abruptly 
decreases from its initial value at t = and very quickly, after a few hundred 
steps, settles down at its steady value when it no longer depends on time. Both 
the utilization and the transition period to to the steady state depend strongly 
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FIG. 4: Simulated time evolutions of characteristic densities and the scaled VTH 
interface velocity during simulations with the load N = 100 per processor (L = 
1000). As in Fig. 121 for times later than to both the utilization (u) (diamonds) and 
the velocity v (filled circles) are constant, but their values are significantly higher 
than in the worst-case performance scenario presented in Fig. |3I 



on the processor load A^. Second, the VTH velocity v(t) must be related to 
{u{t)) by a simple linear scaling relation. Third, the transition time to to the 
steady state can be estimated from simple statistics of the interface. Let us 
elaborate on these issues. 

The characteristic densities (/>(t)) and (/<(t)), plotted in Figs. OHH are 
the fractions of the interface sites (processors) that have their LVT larger and 
equal-or-smaller, respectively, than the mean virtual time h{t). Their relation 
to each other is a simple indicator of the skewness of the distribution of the 
LVTs about the mean virtual time. For the times when they approximately 
coincide this distribution is approximately symmetric. The reason for a non- 
zero skewness for the early times t < to is Ihe flat-substrate initial condition, 
i.e., the initial null LVT on all processors (the detailed analysis of this issue 
can be found in 3]). In the worst-case performance scenario, when each 
processor has the minimal load of iV = 1 (Fig. the duration of this initial 



14 





0.33 - 




0.32 - 




0.31- 




0.3- 


II 


0.29- 


:L; N: 


0.28- 


=3 0.27 - 
V 




0.26- 




0.25- 




0.24- 




0.23 - 



1 1 — I — I I I I I 



— Theory 

-- Asymptotic limit 

O Simulation data 




-I I I 



10 



100 



1000 



FIG. 5: The steady-state mean utilization vs the number of processors L for the 
minimal load per processor: the analytical result (continuous curve), its asymptotic 
limit liuiL^ooiuiL; 1)) = 1/4 (horizontal line) and simulation results (symbols). The 
error bars are smaller than the symbol size. 



transition time to to the steady state is a non-universal property of the VTH 
interface. This means that in real applications the time to will depend on the 
application platform, i.e., the hardware configuration and parameters. But, for 
a real application to can be determined in a non-expensive way by monitoring 
(/>(t)) and (/<(t)) in a trial simulation with a fixed N = Nq. Then, the results 
can be scaled either up or down for an arbitrary load N. The existence of such 
scaling is the universal property of the VTH interface, which is discussed in 
SecfVl where the explicit scaling relations are given. 

The overall progress of PDES can be estimated from the time rate of the 
global virtual time (GVT) that is the smallest LVT from all processors at t. 
The GVT determines the fossil collection, i.e., the memory that can be re/de- 
allocated from past-saved events. In our model, the mean GVT is the mean 
global minimum {hcvTif)) of the simulated VTH: hcvrit) = mm{hkit) : 1 < 
k < L}. On the average, the time rate of (/iGvr(t)) is not larger than the VTH 
interface velocity v{t) and after cross-over time tx to saturation (indicated in 
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FIG. 6: The steady-state mean utilization vs the number of processors L for loads 
N > 2 per processor: analytical results (continuous curves) and simulation results 
(symbols). The error bars are smaller than the symbol size. 

Figs. OHH and Figs. IHHTUj) these two rates are equal. Thus, v(t) is the measure 
of progress. It is shown analytically [29] that v{t) = (u(t))/i, where is a 
constant. For the simulated VTH /i = /xp = 1, as can be seen in Figs. OHll In 
a real application /i is a hardware dependent parameter. 

Since the overall progress in PDES connects linearly to {u(t)), the mean 
utilization is the most important property of this algorithm. The steady-state 
utilization {u{L] N = 1)) for the worst-case performance as a function of the 
number L of processors is presented in Fig. |S1 For brevity of notation, from 
now on we omit the index t because for t > to the utilization is constant. 
Considering the simulation data alone (symbols in Fig. Ej) it is easily seen that 
even at the infinite limit of L the mean utilization has a non-zero value. In 
fact, this asymptotic value is approached very closely with less than a thousand 
processors. The existence of this limit guarantees a non-zero progress of PDES 
for any value of L and for any processor load N, since the curve {u{L; N = 1)) 
is the lower bound for the steady-state utilization {u{L] N)) (compare with 
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Fig. ED- 

The presence of this non-zero hmit for 1)) as L ^ oo and the global 
behavior of the simulation data, observed in Fig. El suggest the existence of 
some underlying scaling law for {u{L] 1)). It appears that, indeed, the under- 
lying scaling law can be obtained analytically from the first moment of the 
probability distribution P(p; L) given by Eq.Q and the result is [17 1 

M^>3;1)) = ^, (8) 

and {u{L = 2; 1)) = 1/2. This function is plotted in Fig.^lfor L > 3. Similarly 
for any N > 2, starting with P{p] L), one can construct the probability distri- 
bution P{p; L; N) of updates in the system of L processors, each having a load 
of (i.e., the distribution of updates on the corresponding VTH interface). 
The mean uti ization {u{L; N)) can then be obtained from the first moment 
of Pip; L;N) 3: 

(.(L>3;iV>2))^ri-i(^Vl-^^y P) 



4 L 



where g(A^) = ^/2/N, and 



(uiL = 2; AT > 3)) = 1 - (10) 

and {uiL = 2; N = 2)) = 1/2. Relation (fTUj) is exact. Relation is presented 
in Fig. El As L ^ cxD the asymptotic limit is (m(oo; A^ > 3)) = (2— g)(4— g)/8 > 
(m(oo;A^= 1)) = 1/4. 

The computational speedup s of a parallel algorithm is defined as the ratio 
of the time required to perform a computation in serial processing to the time 
the same computation takes in concurrent processing on L processors. It is 
easy to derive from the above definition that for an ideal system of processors, 
that is for the particular update algorithm considered in this work, the mean 
speedup is 

{s)=L{uiL;N)). (11) 

In other words, (s) is measured by the average number of PEs that work 
concurrently between two successive update attempts. 
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We observe that for ideal PEs the speedup as a function F{L) must be such 
that the equation F{L) = s has a unique solution, where s is a fixed positive 
number. This requirement follows naturally from the logical argument that 
distributing the computations over L ideal PEs gives a unique speedup, i.e., 
two ideal systems having sizes Li and L2, respectively, may not give the same 
s. This means that F{L) must be a monotonically increasing function of L. 

Combining Eq. ()ll|) and Eq.(jH)), in the worst-case performance when the 
load per processor is minimal, the mean speedup is {s{L; N = 1)) = {L + l)/4 
and {s{L = 2,3; N = 1)) = 1. The latter relation says that this algorithm 
produces no speedup when the computations are distributed over 2 or 3 pro- 
cessors. In this case, although the utilization is 1/2, the processors do not 
work concurrently but alternately, i.e., one is working while the other is idling. 
For a real system of PEs performing PDES, in such a situation the communi- 
cation overhead will produce an actual slowdown, i.e., the parallel execution 
time will be longer than the sequential execution time on one processor. When 
L > 4, to take an advantage of concurrent processing the average number of 
PEs working in parallel between two successive update attempts must satisfy 
(L + l)/4 > 2, which gives L > 7. Still, depending on the implementation 
platform, the actual speedup of PDES may be negligible or not present at all 
for small L. 

For a general load per processor, combining Eq. ljllj) and Eq.Q gives a 
linear relation with respect to L: 

Equation ()12|) can be rearranged to 

where q{N) = 1 — q{N) is the probability that a processor performs an up- 
date without the need of communicating with other processors {2^ • This 
probability sharply increases with the processor load. Since the mean speedup 
increases quadratically with q{N) and only linearly with L, for some PDES that 
perform the updates in accordance to this algorithm it may be more advanta- 
geous to assign more load per processor than to distribute the computations 



{s{L >3;N> 2)) 



(12) 
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FIG. 7: Simulated time evolution of the VTH width in the worst-case performance 
scenario = 1. Time to, common for all L, marks the transition to the steady-state 
simulations of Fig. |31 

over a large number of processors. In any case, either of the above relations, 
Eq. lfT^ or (fT!^ . can be used to assess the upper bound for the speedup in 
actual apphcations. 



V. DYNAMICS OF DESYNCHRONIZATION 

In PDES the memory request per processor, required for past-state sav- 
ings, depends on the extent to which processors get desynchronized during 
simulations. In our model, the statistical spread w(t) of the simulated VTH, 
as illustrated in Fig. |2l provides the measure of this desynchronization. In 
simulations the width of the VTH interface is computed using Eq. (jHl). The 
representative results of simulations for the case of the minimal load per proces- 
sor are presented in Fig. [7| For any number L of processors the time evolution 
of the VTH width has two phases. The first phase, for t -C tx, is the growth 
regime, where for t > Iq the width w{t) follows a power law in t with the 
growth exponent [3 = 1/3. The second phase, after the cross-over time tx, 
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FIG. 8: The scaled time evolution of the simulated VTH widths of Fig. [7| for all 
times t > 0. The insert shows the full data collapse for t > to, with the growth 
exponent f3 = 1/3. 

is the saturation regime, where w{t) has a constant value that depends only 
on the system size and follows a power law in L with the roughness expo- 
nent a = 1/2. The values of these exponents are characteristic of the KPZ 
universality class. Explicitly, the evolution can be written as 



'2a 



(14) 



where to is the initial regime where the Family- Vicsek scaling law, Eqs.(0HSI), 
does not hold. This can be seen directly when the scaling is performed for 
all t > 0, as illustrated in Fig. |H1 The whisker-like structures that appear 
after data collapse in the growth phase, clearly observed in Fig. |H1 indicate 
that in the initial start-up time t < the curves in Fig. [7| before scaling 
follow one evolution for all L. The insert shows the universal Family- Vicsek 
scaling function, Eq.(jni), for t > to- Here, the cross-over time scales as tx ~ 
L^, where z = a/ (3 = 3/2. The presence of the initial non-scaling growth 
regime is an artifact of the flat-substrate initial condition. Its duration to is 
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FIG. 9: Simulated time evolution of the VTH width for loads > 1. There are two 
growth regimes, characterized by two exponents (i\ and /32- The duration of the early 
phase depends on N . In this early phase, simulations are not in the steady-state: 
the squared width increases linearly with time. 



a non-universal parameter that can be determined in PDES by monitoring 
characteristic densities and the utilization, as discussed in Sec. IIVI 

When the simulations are performed for the case when each processor carries 
a load A^, the evolution of the VTH width changes. Now, as illustrated in Fig. El 
for L = 1000, there are two distinct phases in the initial growth regime. The 
early phase evolves in the RD fashion, having the growth exponent /?o = 1/2, 
and the later phase has signatures of the KPZ scaling: 



J2a 



t < toiN) 

to(iV) <t <tx(A^) 



(15) 



where both and depend on the processor load. The initial RD-like 
growth does not scale with L. This lack of scaling extends to approximately 
to{N) oc tQN/2, where to marks the end of the initial relaxation period when 
= 1 or = 2. The physical justification for the presence of the RD 
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FIG. 10: The scaled time evolution of the simulated VTH widths for general values 
of A'^ and L. This scaling function is valid only for steady-state simulations. In the 
scaling regime the growth exponent is /? = 1/3, as in Fig.|H| The time to saturation, 
when the width is constant, depends on both and L. 



growth is that when > 3 there is a non-zero probabihty of having some 
processors performing state updates without the need of communicating with 
other processors. This probabihty of "uncorrelated" updates increases when 
the processor load increases. However, even for a large but finite there 
are some processors that may not complete an update without communica- 
tion with another processor, thus, correlations are build among the processors 
and propagate throughout the system. Eventually these "correlated" updates 
cause the VTH interface to saturate. The net effect of having a large load per 
processor is the noticeable elongation of the time scale over which the corre- 
lations are build, but the dynamics of building these correlations belongs to 
the same universality class as in the case of the minimal load per processor. 
Therefore, it is expected that the simulated VTH should exhibit KPZ univer- 
sality in this well, as soon as the correlations become apparent. Indeed, 
after the initial transition time to(^) the VTH widths can be collapsed onto 
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the following scaling function [2 



2 V^^^ 



) 



(16) 



where f{y) satisfies Eq.(0), and z = 2 — a with a ~ 1/2. This scaling function 
is presented in Fig. ^| Accordingly, the VTH interfaces belong to the KPZ 
universality class. In the scaling regime, the evolution can be written explicitly 

as 



where ty,{N) ~ {N/2)L', /3 = 1/3, a = 1/2 and z = 3/2. For < t < to{N) ~ 
toN/2, for all L the width follows the power low {w{t)) ~ t*, where (3o = 1/2. 



The consequence of Eq. fll7|) is that the memory request per processor does 
not grow without limit but varies as the computations evolve. The fastest 
growth characterizes the initial start-up phase. The length of the start-up 
phase depends on the load per processor. The start-up phase is characterized 
by decreasing values of the utilization. In the steady-state simulations, when 
the utilization has already stabilized at a mean constant value the memory 
request grows slower, at a decreasing rate ~ l/t^l"^ . In this phase, the mean 
request can be estimated globally from Eq. lfTTj) . The important consequence 
of scaling, expressed by Eq. lfTHjl . is the existence of the upper bound for the 
memory request for any finite number of processors and for any load per pro- 
cessor. On the average, this upper bound increases proportionally to yiVL 
with the size of conservative PDES. 

The characteristic time scale to{N) from the first step to the steady-state 
simulations can be estimated by monitoring the utilization for the minimal 
processor load (to determine to) and, subsequently, scaling this time with A^. 
Similarly, the characteristic time scale to tx{N), when the desynchronization 
reaches its steady state, can be scaled with the processor load to determine an 
approximate number of simulation steps to the point when the mean memory 
request does not grow anymore. 




to(iV) <t <tx(A^) 
t >tx(A^), 



(17) 
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VI. CONSERVATIVE VS OPTIMISTIC 

While the conservative algorithm strictly avoids the violation of the local 
causality constraint, the optimistic algorithm may process the events that are 
out of the time-stamp order which might violate this constraint. At times 
when the conservative algorithm forces the processors to idle the optimistic al- 
gorithm enforces the computations and state-updates, thus, according to our 
adopted definition, the theoretical utilization of the optimistic update scheme 
is always at its maximum value of one because the processors never idle. How- 
ever, some of the events in the thus processed stream of events on a processor 
must be processed prematurely, judging by the random nature of the opti- 
mistic scheme that takes a risk of guessing whenever the next event is not 
certain. When in the course of an update cycle a processor receives a straggler 
message, i.e., a message "from the past" that has its time-stamp smaller than 
the clock value, all the later events that have been processed incorrectly must 
be cancelled. The processor must then send out cancellation messages (called 
anti-messages) to other processors to roll back all the events that have been 
processed prematurely. Thus, in the optimistic update scheme, although the 
processors never idle, the computation time of the update cycle is not utilized 
fully efficiently because some part of this time is used for the meaningless op- 
erations (i.e., creation and processing of the rollbacks) and only part of a cycle 
represents the computations that assure the progress of PDES. 

There are many variations of optimistic update schemes, e.g., Refs. 



10,|ll| and references in oriented to building implementations with better 
efficiencies and memory management. The key feature of the update mecha- 
nism, as described above, and main concepts such as rollback and GVT, first 
introduced in Jefferson's Time Warp [3], can be treated as the generic proper- 
ties of the optimistic algorithm. In its generic form, the algorithm keeps the 
already processed events in the memory in case of the necessary re-processing 
required by the rollbacks. 

For the ring communication topology, we simulate the growth of the VTH 
corresponding to the optimistic update procedure as Poison-random deposi- 
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tions to the lattice sites in analogy to the model described in Sec. |TT1 However, 
in the optimistic model the deposition rule is modified to mimic key features 
of the optimistic algorithm. We assume that each update cycle on each pro- 
cessor consists of processing Nc events, only some of which can be eventually 
committed. With each of these events there is the associated random time 
increment t]. Now the integer index t represents the update cycle. The main 
difference between the optimistic and the conservative simulation models is 
that any time the conservative would wait the optimistic is allowed to per- 
form a random guess. In the simplest case of totally unbiased guesses, in each 
cycle the number of correctly guessed events is obtained from a uniform dis- 
tribution. The cumulative simulated LVTs that correspond to processing all 
events form the simulated optimistic VTH. The cumulative simulated LVTs 
that correspond to processing only the correctly guessed events form the simu- 
lated progress VTH, which is embedded in the optimistic VTH. The difference 
between the optimistic VTH and the progress VTH represents the cumulative 
time that has been wasted by generating and processing erroneously guessed 
events and their associated rollback operations. We define the overall efficiency 
of the optimistic algorithm as the ratio of the total progress time to the total 
computation time. At t, the total progress time and the total computation 
time are obtained by integrating the progress VTH and the optimistic VTH, 
respectively, and are represented by the areas under these VTH interfaces. In 
analogy with the above definition, we define the overall efficiency of the con- 
servative algorithm as the ratio of the total computation time (i.e., the area 
under the conservative VTH) to the total time that the processors spend on 
computations and idling. These efficiencies are presented in Fig. for the 
worst-case scenario of the minimal load per processor and Nc = 1. 

Our simulations (Fig. lTT|) confirm the common conception that, in the ideal 
setting, the optimistic algorithm should outperform the conservative algorithm 
Q]. As Fig. 111! shows, the lower bound for the steady-state conservative ef- 
ficiency is about 25% and coincides with the lower bound obtained for the 
utilization (Fig. El). For the same case, the steady-state optimistic efficiency 
is about 62%; accordingly, the optimistic algorithm has a better utilization of 
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FIG. 11: Simulated time-evolution of overall efficiencies in an optimistic (upper 
curve) and a conservative (lower curve) PDES when processors carry minimal loads 
(L = 10,000). 



the parallelism. In actual applications the conservative efficiency can be im- 
proved by exploiting in programming a concept of lookahead, based on actual 
properties of the distributed PDES physical model under consideration. 

The statistical spread of the simulated optimistic VTH is presented in 
Fig. El (note, this figure presents the results obtained in only one simulation). 
The simulated optimistic VTH belongs to the RD universality class and the 
spread in local virtual times grows without limit in accordance to the power 
law w(t) ~ ^/t. Intuitively, this result should be expected because, by analyz- 
ing the operation mode of optimistic updates, one notices that the processors 
work totally independently, progressing their LVTs in an uncorrelated fashion. 
Thus, it should be expected that the memory request per processor required 
to execute generic optimistic PDES grows without bounds as ^/i when the 
simulations are progressing in t. 

The unboundedness of the memory request in a generic optimistic scheme 
can be also justified using quite different arguments however, the power-law 
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t 

FIG. 12: Simulated time-evolution of desynchronizations in an optimistic PDES 
when processors carry minimal loads {L = 10,000): the widths of the optimistic 
VTH (upper curve, "plus" symbol) and of the progress VTH (lower curve, dia- 
monds) . 

growth for this request, illustrated in Fig. El has been never reported before. 
The adverse ways in which such an unbounded desynchronization affects the 
performance and standard remedies that can be taken to improve on the use 
of computing resources by optimizing the optimistic memory management, are 
well-known issues (a comprehensive discussion can be found in Ref. [3|). In 
general, the generic optimistic update scheme requires some kind of explicit 
synchronization procedure that would limit the lengths of rollbacks. 

The actual performance of the PDES application may depend on the par- 
ticulars of the underlying physical dynamics of the physical system being simu- 
lated and the best choice of the algorithm may be uncertain in advance without 
some heuristic trial studies. For example, recently, Overeiner et al. 0, |^| 
reported the first observation of self-organized critical behavior in the per- 
formance of the optimistic Time Warp algorithm when applied to dynamic 
Monte Carlo simulations for Ising spins on a regular two-dimensional lattice. 
They found that when this PDES approaches a point when the physical Ising- 
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spin phase transition is being simulated (i.e., the critical point of the physical 
dynamics), the average rollback length increases dramatically and simulation 
runtimes increase nonlinearly. In Ising-spins simulations, increases in roll- 
back lengths are to be expected since around the Ising critical temperature 
the physical system is characterized by the presence of long-range spin-spin 
correlations and collective behavior, where large-scale spin-domains may be 
overturned simultaneously. Consequently, approaching the critical point of 
physical dynamics should produce a decreased number of committed events. 
However, the simultaneous nonlinear increase of the simulation runtime when 
PDES approaches this critical point seems to be a property of the optimistic 
algorithm since a similar problem is not observed when the same physical sys- 



tem is being simulated using the conservative algorithm [2^. One possible 
explanation for this nonlinear deterioration in runtime may be a nonlinear 
cache behavior when rollback lengths increase beyond a certain critical value 
and memory requests increase. The role of the cache behavior, in particu- 
lar, the nonlinear performance degradation with the number of cache misses, 
has been recently discussed in regard with the efficient implementation of an 
asynchronous conservative protocol for a different phy sical system [s^. An- 
other possible explanation, as conjectured in Ref. 30|, may be the onset of 
self-organized criticality in the Time Warp simulation systems, unrelated to 
the physical critical state. Further studies are required to extract universal 
properties of optimistic protocols to identify a class of simulation problems 
that would show in computations a similar behavior to that encountered in 
PDES for Ising spins. 

VII. CONCLUSION 

The performance of the distributed PDES algorithms depends in general 
on three main factors: the partitioning of the simulated system among pro- 
cessors, the communication topology among the processors, and the update 
protocol b eing adopted. In a heuristic approach to performance studies (e.g., 
as in Ref. j22,l2l|) the application algorithm often utilizes physical properties 
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of the model to be simulated; thus, the conclusions of such studies, as being 
application specific, may have a limited scope of generalization. In this chap- 
ter, we presented a new way to study the performance of PDES algorithms, 
which makes no explicit reference to any particular application. In this new 
approach, the system of processors that perform concurrent update operations 
in a chosen communication topology is seen as a complex system of statistical 
physics. First, based on the update and the communication patterns of the 
algorithm, we construct a simulation model for its representative virtual time 
horizon. The statistical properties of this virtual time interface correspond to 
the properties of the algorithm, that is, to the properties of the pattern in which 
the correlations are formed and propagate in the computing system. Second, 
we extract the properties of the algorithm from the statistical properties of its 
simulated virtual time horizon. 

In this chapter, we demonstrated how this approach can be used to elu- 
cidate the key generic properties of the asynchronous conservative parallel 
update protocol in the ring communication topology among processors. For 
a finite PDES size, i.e., a finite load per processor and a finite number L 
of processors, our findings can be summarized as follows. Both the utiliza- 
tion of the parallel processing environment and its desynchronization can be 
derived explicitly as theoretical functions of L and (these are Eqs.(j8l fT0|l 
and Eqs. ()15llf7jl . respectively). These functions express the existence of the 
underlying scaling laws for the corresponding virtual time horizon and are 
understood as approximate relations in the sense of statistical averages. The 
existence of these scaling laws presents one aspect of scalability of this type of 
PDES algorithm. The other aspect of algorithmic scalability is the behavior 
of these functions when and L increase. In the limit of large L there is a 
theoretical non-zero lower bound for the utilization, for any A^, and the value 
of this bound increases with A^. On the other hand, for any A^ and L there 
is a finite upper bound for the desynchronization, thus, for the mean memory 
request per processor during steady-state simulations. Therefore, this kind of 
conservative PDES algorithm is generally scalable. 

The model simulation of the virtual time horizon for the generic optimistic 
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update protocol in the ring communication topology (Sec. IVI|) showed that 
the optimistic algorithm lacks the algorithmic scalability. As the optimistic 
simulations evolve in the steady state, the width of the optimistic virtual time 
horizon grows without limit for any finite PDES system. In other words, even 
for the minimal load per processor, the memory request per processor ever 
increases as the square root of the performed number of time-stamped update 
cycles, as the simulations evolve in time. Therefore, the generic version of this 
algorithm demands some form of explicit periodic synchronization. 

One advantage of studying the PDES algorithms in terms of their corre- 
sponding virtual time interfaces is the possibility of deriving explicit diagnostic 
formulas for the performance evaluation, such as, e.g., the evaluation of the 
speedup given by Eqs. ()12llf3jl or the estimate of the memory request given by 
Eq. ()17p for the conservative algorithm considered in this study. These theoret- 
ical estimates should be treated as the ideal upper bounds for the performance 
when PDES are implemented on the real computing systems. A real imple- 
mentation will produce a deviation from the theoretical prediction, depending 
on the computing platform and on other components of simulation algorithms. 
The extend to which the performance of the implementation scales down from 
the ideal performance should provide important information about possible 
bottlenecks of the real implementation and should be a guide to improving the 
efficiencies. 

The other benefit that comes from the modeling of virtual-time interfaces 
is a relatively inexpensive design tool for new-generation algorithms, without 
a prior need for heuristic studies. For example, knowing that in the ring com- 
munication topology the maximal conservative memory request per processor 
for past state savings gets larger as the simulation model gets larger, it is easy 
to predict the maximum model size that would fit the available memory in the 
system. However, the available memory resources vary across implementation 
platforms, so it may happen that one size simulation model may fit on one 
platform and may be too large for the other, having the same number of avail- 
able processors. The question then is: how to modify the update algorithm to 
allow for the tunable memory request? Obviously, the question concerns the 
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control of the extent to which the processors get desynchronized in the course 
of simulations, i.e., the control of the VTH width. One can think about a 
suitable update pattern that would model the virtual-time interface of the de- 
sired properties and then translate this pattern to a new update procedure of 
the modified algorithm. This approach has been used to design a constrained 
conservative update algorithm |16|, where the desynchronization is controlled 
by the width of a moving virtual update window and the ring communication 
topology is modified to accommodate multiple connections between a proces- 
sor that carries GVT at given update attempt and other processors. In an- 
other group of conservative algorithms an implicit autonomous synchronization 
may be achieved by modifying the ring communication pattern to accommo- 
date connections with the build-in small-world type of communication network 



19|. In both of these modifications, the additional communication network 
imposed on the original ring communication topology serves the sole purpose 
of reducing the desjTichronization. Further studies are required in this matter 
to identify best efficient ways of tuning the desynchronization and the memory 
request. 

In summary, the new approach to performance studies, outlined in this 
chapter, that utilizes simulation modeling of virtual-time interfaces as a tool 
in algorithm design, opens new interdisciplinary research methodologies in the 
physics of complex systems in application to computer science. Promising 
avenues where this kind of approach to complex systems of computer science 
should lead to useful practical solutions may include the criticality issues in 
distributed PDES algorithms, their scalability, prognostication, the design of 
efficient communication networks as well as the development of new diagnostic 
tools for the evaluation of hardware performance. 
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